4 Main Analysis (Exploratory Data Analysis)
In our report, we have taken the Macro to Micro approach. We start with a brief overview of the whole league, then we narrow down to compare team-specific performances. Orignially, we have chosen all 31 teams and tried to analyze in our presentation. Then we realised that plotting 31 teams together actually makes it impossible to interpret. We therefore decided to analyze the top 2 and bottom 2 teams in each region. Eventually, we zoom into the analysis of individual players. Occassionally, we may break this structure to provide you with a better visual comparison bewteen different seemingly random perspectives and explain how they are related.
4.1 Overview of the whole league
4.1.1 Total number pf games played vs number of wins
#number of games played vs number of wins
df1 = clutch[,c('GP','W','team')]
df1= gather(df1,type,count,-team)
#df1$count <- ifelse(df1$type =="W",df1$count*(-1),df1$count)
temp = df1[df1$type=='GP',]
new_levels= as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count, decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
geom_bar(stat="identity",position="identity")+
xlab("number of games")+ylab("name of teams")+
scale_fill_manual(name="type of games",values = pal("five38"))+
coord_flip()+ggtitle("number of games played (GP) v.s number of wins (W)")+
geom_hline(yintercept=0)+
ylab("number of games")+
xlab("team name")+
scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
theme_scientific()
This is the plot of number of clutch games played and wins for each team. Since clutch games played is always a superset of cltch wins, we plot number of wins(in red) inside total number of clutch games (in blue) to represent the ratio. From this simple plot, we can observe that the WAS played the largest number of cluth time, actually almost 2/3 games in the season for WAS(Washington Wizards) have clutch time. On the other hand, GSW(Golden State Warriors) played the least. Note that less clutch games does not necessarily mean the team is better, because chances are that they lose the game without entering clutch time. Howver, in general more clutch games show that the team cannot finish opponents with overwhelming advantages. Meanwhile, usually we think a better team is more united and can win the game in the clutch time, and the plot proves that. From the plot we can see the world champion in last year, GSW(Golden State Warriors), has a very high rate of win in clutch time, that means in some games that two teams have fair performance, GSW can usually take the win back to the bay area; however, the BKN(Brooklyn Nets)which ranked last one in the league last season, loses almost all clutch games.
4.1.2 Personal fouls (PF) and turnovers (TOV)
df1 = clutch[,c('PF','TOV','team')]
df1= gather(df1,type,count,-team)
df1$count <- ifelse(df1$type =="PF",df1$count*(-1),df1$count)
temp = temp = df1[df1$type=='TOV',]
new_levels= as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count, decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
geom_bar(stat="identity",position="identity")+
xlab("counts")+ylab("name of teams")+
scale_fill_manual(values = pal("five38"))+
coord_flip()+ggtitle("Personal fouls (PF) and turnovers (TOV)")+
geom_hline(yintercept=0)+
ylab("counts")+
xlab("team name")+
scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
theme_scientific()
We have seen this graph above for rounding pattern. The reason we bring up this plot again is due to its relationship with the following plot on concentration at the clutch time. In the clutch time, there are two things we should avoid, consider this two scenario: at the last 10 seconds, your team down 2 points to the opponent but hold the possession, that means you have the last chance to tie or win the game, however what if you turnover and give the possession to the opponent? Or if you lead 1 point in the last 10 seconds, but what if you foul on James Harden and give him two free throws (2 points potential)? So at the clutch time, your coach must require you to avoid turnovers and fouls. As we said above, better team should has fewer turnovers and fouls. From the plot we can find out PHI(Philadelphia 76ers) gave most turnover and one of the most fouls at the clutch time, which accords its rank(second to last in the eastern conference) in the league, the same thing is for SAC(Sacramento Kings, third to last in the western conference). Moreover, there is no team that has smallest numbers in both turnovers and fouls, which can be interpreted as there is no dominating team in terms of fault control in clutch time.
4.1.3 divergent plot on points decomposition
df1 = clutch[,c('PCT_PTS_2PT','PCT_PTS_3PT','PCT_PTS_FT','team')]
df1= gather(df1,type,count,-team)
temp = df1[df1$type=='PCT_PTS_2PT',]
new_levels= as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
df1$count <- ifelse(df1$type =="PCT_PTS_2PT",df1$count*(-1),df1$count)
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
geom_col()+
xlab("percentage")+ylab("name of teams")+
scale_fill_manual(values = pal("five38"))+
coord_flip()+ggtitle("2PT%,3PT%,FT%")+
geom_hline(yintercept=0)+
ylab("percentage")+
xlab("team name")+
scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
theme_scientific()
This plot gives us a very direct visual presentation on the decomposition of shooting percent of points by different teams. It contains the proportion of 3 points, 2 points and free throw for each team, and all sum as 1. In the clutch time, teams usually use the most familiar tactics they have, so in the plot, we can easily analyse the strategies of each team. For defensive coaches, maybe they can come up with corresponding defense strategies. For TOR(Toronto Raptors), it has largest proportion of 2-point field goal, it can be explained by that the best player of this team Demar DeRozan is one of the best players in mid-range jump shoot. As a natural result, it has one of the worst 3 points shoot percent in the league, maybe also because Demar DeRozan did not shoot 3 points very often. Therefore the oponents can leave some space for TOR’s players to shoot three, but they should focus on preventing TOR shooters come closer to the basket. On the other hand, HOU(Houston Rockets) as a team devoted for three points shoot, its 3 points percentage is pretty high but it has lower two points shoot percens. The the opponets can tolerate some mid-range shoots but shoud definitely push harder Three-point line.
4.1.4 Scatterplot on aggressiveness and defensiveness
library(png)
library(ggplot2)
library(gridGraphics)
library(ggimage)
path = 'https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/'
#img <- "https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/ATL.png?raw=true"
df1 = clutch[,c('OFF_RATING','DEF_RATING','team')]
df1$img = paste(path,df1$team,'.png?raw=true',sep='')
ggplot(df1,aes(x=OFF_RATING,y=DEF_RATING))+geom_point()+
scale_y_reverse()+geom_image(image = df1$img, size = .05)+
theme_scientific()+
xlab('offensive rating')+ylab('defensive rating')
In this part of the analysis, we will provide an analysis on the interaction between the previous three plots.
The scatter plot provides us a demonstration of how offensive or defensive the team is during clutch time. The exact definition of offensive rating and defensive rating are quite complicated so we omit it here. Intuitively the higher offensive rating this team is , the more goals it can make, while the less defensive rating means opponents can make less points. We can observe that MIL is a very defensive team with a very low offensive rating. BOS(Boston Celtics) is a very offensive team with the highest offensive rating while its defensive ability is not outstanding enough.
Teams like OKC(Oklahoma City Thunder), SAS(San Antonio Spurs) and WAS(Washington Wizards) have high ratings in both scales. This is an indication of their strong performance in both defence and offence which can be seen as a proof for strong teams. This is further supported by our plot on total clutch games matches played and number of wins. WAS has the highest number of absolute wins. OKC and SAS have their winning rate among the top 5.
We would expect an aggressive team to have a larger number of personal foul. However, by comparing the plot on personal fouls and the offensive rating. There does not seem to be a direct relationship between them. This means that a team with good defense ability does not mean they will incur more personal fouls.
Interaction between score decomposition plots and defense-offense plot is also very interesting. Is there a relation between aggressiveness and the way they score? As we have mentioned before: TOR has the highest percentage of 2PT and a very low percentage of 3pt. However, CLE is completely opposite side. We can observe that TOR has a very high defensive rating while CLE has a very high offensive rating. One potential explanation will be that 3 pts is viewed as a much riskier and more offensive scoring method as compared to the much safer 2 pts.
In the mean time, for teams have bad performance in both offense and defensive like PHI, MIA, BKN, PHX, they are all young teams (in terms of average ages of players) and the best players in those teams are usually under 22 years old. We can safely say that rookies can not take as much pressure as veterans can take, so in general they perform worse when the big moment comes.
4.1.5 Traditional measure on TSP VS PTS
# Define FGA: Field Goal Attempt
FGA = df_fgm$overall / df_fct$overall
# Define TSP: True shooting percent
TSP = df_pts$overall/(2*(FGA+0.44*df_fta$overall))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]
##==================================================================
#Plot on whole data, all teams
p_TSP = ggplot(df_pts_v1_2)+
geom_point(aes(overall,TSP,color = player_name),size = 1)+
facet_wrap(~Team_Name)+
labs(title = "TSP V.S PTS Facet on Team",x = 'Overall PTS', y='Overall TSP')
ggplotly(p_TSP)
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs","Lakers","Suns","76ers","Nets")
TopLowP_TSP = df_pts_v1_2[df_pts_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_TSP = ggplot(TopLowP_TSP)+
geom_point(aes(X5min_plusminus_5,TSP,color = player_name,shape=Rank),size = 2)+
facet_wrap(~Team_Name)+
labs(title = "TSP VS PTS with 5mins+/-5pts",x = 'PTS', y='TSP')
ggplotly(p_TSP)
From a micro level, we can observe from the graph that in top4, the best players will take over the game in the clutch time, like Lebron Jame in Caverliers, Kyrie Irvine in Celtics, Kawhi Leonard in Spurs, Ste phen Curry and Kevin Durant in Warroris, the reason may be coaches usually trust the best players, and they will make most of the shoots. But it is worth to note that there are some good player in those team that can perform exceptionally well in clutch time, for example, Kyler Korver in Caverliers, Danny Green in Spurs, maybe they should share more shoots.
From a macro level, we can see that strong teams like Celtics and Spurs have a very high True shooting percentage. This is the traditional measure of the performance of a team. Moreover, we have analyzed before that spurs have a high 3pt ratio, yet the rate is so high, it is a reflection of the quality of the team members and leading to the good performance of the team.
Hence, in this part, we illsutrated that we should not look at the traditional data or our data alone. We should integrate them. Spurs true shooting rate is good on its own. However, coupled with its high 3 pcts attempts and its aggressive style, this makes it more valuable.
Moreover, if we look at TSP alone, we can actually find 76er have a pretty decent performance. However, if we cross-reference with its defensive strategy and high 2pct ratio. This figure may not be as convincing. This is one example of how we can integrate the traditional data and the alternative data.
4.2 Team specific analysis
In the section, we zoom down to the top 2 and bottom 2 teams in both the east and west regions. Instead of analyzing the tradtional team statistics, we choose to look at the team performance in the clutch time. Unlike other sports, the last few seconds in a basketball match can make a huge difference. Furthermore, NBA players do not have a huge difference in their performance in normal times as compared to other sports. In the clutch time, when every player is on their term, it is a true test of their mental stability, stamina and skills. Their difference in abilities and performance will be amplified in the final few seconds. Therefore, we believe that analyzing clutch time performance can give us great insight into the performance of the team.
4.2.1 3pcts vs 3fgm Facet on Top4Last4
#Plot on Top4 Last4
df_3pct['df_3fgm_overall']=df_3fgm$overall
df_pct3_v1 = df_3pct
df_pct3_v1_2 = df_pct3_v1[!is.na(df_3fgm$player_name),]
TopLowP_TSP = df_pct3_v1_2[df_pct3_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_3FGM3PCT = ggplot(TopLowP_TSP)+
geom_point(aes(df_3fgm_overall,overall,color = player_name,shape=Rank),size = 1)+
facet_wrap(~Team_Name)+
labs(title = "3pct_overall V.S 3fgm_overall",x = '3fgm_overall', y='3pct_overall')
ggplotly(p_3FGM3PCT)
This is a traditional method to analyze the performance of the team. 3 points is an important way to score in the basketball game and has a dominant effect on the final results of the game. Like our previous analysis, all top 4 teams have very high 3 poins rate. The rate is extremely high for Spurs which confirms our previous analysis.
4.2.2 Team Average Overall fgm
##==================================================================
#Plot on All team
df_all$Team_Name.x = as.factor(df_all$Team_Name.x)
countorder = df_all %>% group_by(Team_Name.x) %>% summarize(av=mean(overall.x, na.rm=TRUE))
#df_all = merge(df_fgm,df_pct,by = "player_id",all=TRUE)
ggplot(countorder, aes(reorder(Team_Name.x,av),av)) +
geom_col(color = "tomato", fill = "orange", alpha = .2)+
coord_flip()+
theme_scientific()+
labs(title = "Team Average Overall fgm",x = 'Team', y='Average Overall fgm')
##==================================================================
#Plot on Top4 Last4
TopLowP_TSP_1 = df_all[df_all$Team_Name.y %in% TopLowTeam,]
countorder = TopLowP_TSP_1 %>% group_by(Team_Name.x) %>% summarize(av=mean(overall.x, na.rm=TRUE))
countorder['Rank'] = ifelse(countorder$Team_Name.x %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
#countorder
countorder
ggplot(countorder, aes(reorder(Team_Name.x,av),av,fill = Rank)) +
geom_col()+
coord_flip()+
theme_scientific()+
labs(title = "Team Average Overall fgm",x = 'Team', y='Average Overall fgm')+
scale_colour_colorblind("Rank",
labels=countorder$Rank)
Team average overall fgm is a very important traditional factor to measure the performance of the team. We can observe that strong teams do have the tendency to have higher fgm. Spurs seems to be an outlier. However, if we combine our figure with our previous analysis on the agressvieness of Spurs, the high 3 points ratio and the high sucess rate. The relatively low overall fgm can be easily understood. This is another example of how we can link various part together to derive meaning results.
4.2.3 Coordinates plot
# average within group 3point
cbP = c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
df_3fgm_sum = aggregate(df_3fgm[,3:12], list(df_3fgm$Team_Name), sum, na.rm = TRUE)
deno = df_3fgm/df_3pct[,1:13]
deno$player_name = df_3fgm$player_name
deno$player_id = df_3fgm$player_id
deno$Team_Name = df_3fgm$Team_Name
deno_modi = aggregate(deno[,3:12], list(deno$Team_Name), sum, na.rm = TRUE)
average3point = df_3fgm_sum/deno_modi
average3point$Group.1=deno_modi$Group.1
average3point[is.na(average3point)] = 0
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs",
"Lakers","Suns","76ers","Nets")
TopLow3point = average3point[average3point$Group.1 %in% TopLowTeam,]
RK = ifelse(TopLow3point$Group.1 %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
TopLow3point['TRk']= RK
#TopLow3point
p1 = ggparcoord(data = TopLow3point,
columns =2:7,
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average 3PT Last Xmin yDown Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLow3point$Group.1)
p2 = ggparcoord(data = TopLow3point,
columns =c(2,8:11),
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average 3PT Last Xmin yDownorHiger Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLow3point$Group.1)
# average within group all point
cbP = c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
df_fgm_sum = aggregate(df_fgm[,3:12], list(df_fgm$Team_Name), sum, na.rm = TRUE)
deno = df_fgm/df_pct[,1:13]
deno$player_name = df_fgm$player_name
deno$player_id = df_fgm$player_id
deno$Team_Name = df_fgm$Team_Name
deno_modi = aggregate(deno[,3:12], list(deno$Team_Name), sum, na.rm = TRUE)
averagepoint = df_fgm_sum/deno_modi
averagepoint$Group.1=deno_modi$Group.1
averagepoint[is.na(averagepoint)] = 0
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs",
"Lakers","Suns","76ers","Nets")
TopLowpoint = averagepoint[averagepoint$Group.1 %in% TopLowTeam,]
RK = ifelse(TopLowpoint$Group.1 %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
TopLowpoint['TRk']= RK
#averagepoint
p3 = ggparcoord(data = TopLowpoint,
columns =2:7,
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average TotalPT Last Xmin yDown Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLowpoint$Group.1)
p4 = ggparcoord(data = TopLowpoint,
columns =c(2,8:11),
mapping=aes(color=as.factor(Group.1),
linetype = as.factor(TRk)),
scale = 'globalminmax'
)+
scale_linetype_discrete("Rank",
labels=TopLow3point$TRk)+
#scale_color_discrete("Team",
# labels=TopLow3point$Group.1)+
geom_vline(xintercept = 0:6, color = "lightblue")+
theme(axis.text.x=element_text(angle=90))+
labs(title = "Average TotalPT Last Xmin yDownorHiger Top4 V.S Low4",x = 'Indicator', y='Team Average')+
scale_colour_colorblind("Team",
labels=TopLowpoint$Group.1)
grid.arrange(p1, p2, p3, p4, nrow = 2)
From this coordinates plot we can observe here that, traditonal performance measure in clutch time fails to gives us a good indication. This did not meet our expectation, as our original statement was to native to ignore why clutch time will happen in the first place. When a strong team enter clutch time, it is usually due to the major players are in bad shape that day or they will have finished the game in main time. That is why clutch time fails to give us a good indication.
4.2.4 Further analysis on 30s clutch time
##==================================================================
#Plot on ALL
df_pct['df_fgm_overall']=df_fgm$overall
df_pct_v1 = df_pct
df_pct_v1_2 = df_pct_v1[!is.na(df_fgm$player_name),]
p_FGMPCT = ggplot(df_pct_v1_2)+
geom_point(aes(df_fgm_overall,overall,color = player_name),size = 1)+
facet_wrap(~Team_Name)+
labs(title = "pct_overall VS fgm_overall ",x = 'fgm', y='pct')
ggplotly(p_FGMPCT)
df_pct['df_fgm_overall']=df_fgm$X30sec_plusminus_5
df_pct_v1 = df_pct
df_pct_v1_2 = df_pct_v1[!is.na(df_fgm$player_name),]
TopLowP_TSP = df_pct_v1_2[df_pct_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_FGMPCT = ggplot(TopLowP_TSP)+
geom_point(aes(df_fgm_overall,X30sec_plusminus_5,color = player_name,shape=Rank),size = 2)+
facet_wrap(~Team_Name)+
labs(title = "pct VS fgm last 30sec+/-5pts",x = 'fgm', y='pct')
ggplotly(p_FGMPCT)
In the plot, we take a deeper look at the final 30s when the score is tight. This situation is different from the situation above, because in last 30 seconds with plus or down 3 points, everything can happen. This is the real clutch time, but the same thing is that people usually think in this time we should give the ball to the best players to handle. The interesting is Warriors, who is the champion of the last season, the two best players in the team, Kevin Durant and Stephen Curry both have very low pct and fgm compared to the their normal statistics. This confirmed that our previous analysis when a strong team enters clutch time, the star players are usually not performing well that day. However Shawn Livingston the player with more than 10 years’ experience in NBA seems more productive in last 30 seconds’ clutch time. Same thing can be found in the other top 4 teams, veterans usually have better performance, like Al Horford in Celtics, Tony Park in Spurs, even though they are now not the one of the best players in the team, but they can be the best in the clutch time. Advice for coaches: give the ball to veterans and adjust your strategy based on the actual performance of the players on that day.
The colour for players seems redundant. However, the reason we used this method is bacause we want to display the play list at the right side which allows us to do the selection on players. At the same time, for the top4 and low4 plot. we differentiate the top teams with a triangle and the bottom teams with a circle. The colour does not convey any meaning besides allowing us to have the interactive list on the right hand side. This definition is consistent across all the plots in our reports. We will not reiterate in the following parts. (we have tried to solve the issue with the label of y-axis by adjusting various parameters. It worked perfectly on local file, but failed in html. We have done various research online and this seems to be a plot issue.)
4.2.5 3pts average 10second down figure plot(top4 down4)
##==================================================================
#Plot on All Teams
averagepoint=averagepoint[2:31,]
averagepoint['abbr'] = df_name_team_abbr[,1]
average3point=average3point[2:31,]
average3point['abbr'] = df_name_team_abbr[,1]
path = 'https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/'
averagepoint$img = paste(path,averagepoint$abbr,'.png?raw=true',sep='')
average3point$img = paste(path,average3point$abbr,'.png?raw=true',sep='')
##==================================================================
#Plot on Top4 Last4
TopLowP_TSP_1 = averagepoint[averagepoint$Group.1 %in% TopLowTeam,]
TopLowP_TSP_2 = average3point[average3point$Group.1 %in% TopLowTeam,]
p3 = ggplot(TopLowP_TSP_1,aes(overall,X10sec_down_3))+
geom_point()+
geom_image(image = TopLowP_TSP_1$img,
size = .05)+
theme_scientific()+
labs(title = "3pt Average 10sec_down_3 v.s. Overall TopDown4",x = 'Overall', y='X10sec_down_3')
p4 = ggplot(TopLowP_TSP_2,aes(overall,X10sec_down_3))+
geom_point()+
geom_image(image = TopLowP_TSP_2$img,
size = .05)+
theme_scientific()+
labs(title = "Total Average X10sec_down_3 v.s. Overall TopDown4",x = 'Overall', y='X10sec_down_3')
grid.arrange(p3, p4, nrow = 1)
Although the tradtional method in general fails to give us the result we are looking for. The 3pt average performance in the last 10 seconds is highly correlated with the ranking of the team. This figure plot gives us a clear visual representation of the data. One potential reason for this will be strong teams usually have a greater player pool, they will have p points shooter designated for the final shoot. This is why strong team in general have a better last 10 second performance(despite the star players may not in a good shape as we have explained above)
4.3 Player specific analysis
As for individuals, we mainly covers the shooting pattern and missing rate. This will be covered in detail with our interactive components
4.4 Miscellaneous plots without significant discoveries
During our analysis, we have looked have a large number of plots and explored many different aspects. However, we cannot obtain meaningful patterns from some of them. We simply included them in this section to demonstrate the path we have taken.
4.4.1 TSP VS PTS All Star
# Define FGA: Field Goal Attempt
FGA = df_fgm$overall / df_fct$overall
# Define TSP: True shooting percent
TSP = df_pts$overall/(2*(FGA+0.44*df_fta$overall))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]
##==================================================================
#Plot on whole data, all teams
p_TSP_All = ggplot(df_pts_v1_2)+
geom_point(aes(overall,TSP,color = player_name,shape = Team_Name),size = 2)+
labs(title = "TSP V.S PTS All Star",x = 'Overall PTS', y='Overall TSP')
ggplotly(p_TSP_All)
4.4.2 TSP VS PTS on X5min_plusminus_5
# Define FGA: Field Goal Attempt on X5min_plusminus_5
FGA = df_fgm$X5min_plusminus_5 / df_fct$X5min_plusminus_5
# Define TSP: True shooting percent
TSP = df_pts$X5min_plusminus_5/(2*(FGA+0.44*df_fta$X5min_plusminus_5))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]
p_TSP_All = ggplot(df_pts_v1_2)+
geom_point(aes(X5min_plusminus_5,TSP,color = player_name,shape = Team_Name),size = 2)+
labs(title = "TSP VS PTS All Star",x = 'PTS', y='TSP')
ggplotly(p_TSP_All)
4.4.3 3pcts_overall VS 3fgm_overall
##==================================================================
#Plot on All Team
df_3pct['df_3fgm_overall']=df_3fgm$overall
df_pct3_v1 = df_3pct
df_pct3_v1_2 = df_pct3_v1[!is.na(df_3fgm$player_name),]
p_3FGM3PCT_All = ggplot(df_pct3_v1_2)+
geom_point(aes(df_3fgm_overall,overall,color = player_name,shape = Team_Name),size = 2)+
labs(title = "3pct_all V.S 3fgm_all ",x = '3fgm_overall', y='3pct_overall')
ggplotly(p_3FGM3PCT_All)
Although this plot does not carry valuable information to our main analysis. This plot has a very interesting pattern. The X values are discrete rather than continuous if you select a region to zoom in. This is the same as the pattern we have discussed before.
4.4.4 ftm_30sec_plusmiuns_5
##==================================================================
#Plot on All teams
df_fta['df_ftm_30sec_plusmiuns_5'] = df_ftm$X30sec_plusminus_5
df_fta_v1 = df_fta
df_fta_v1_2 = df_fta_v1[!is.na(df_fta$player_name),]
p_fta_ftm = ggplot(df_fta_v1_2)+
geom_point(aes(X30sec_plusminus_5,
df_ftm_30sec_plusmiuns_5,
color = player_name,
shape=Team_Name),
size = 1.3,
alpha=0.5,
position = "jitter")+
labs(title = "FTA VS FTM 30sec+/-5pts",x = 'fta', y='ftm')
ggplotly(p_fta_ftm)
This is the jitter version of our plots in the section #3.2. However, we do not discover any pattern here.
4.4.5 1min_down5 plot
#Plot on Top4 Last4
TopLowP_TSP_1 = df_pct[df_pct$Team_Name %in% TopLowTeam,]
ggplot()+
geom_point(data =TopLowP_TSP_1,
aes(x = X1min_down_5, y= overall),
position = position_jitter(w = 0.01, h = 0.02),
alpha = 0.5,
size = 3)+
facet_wrap(~Team_Name)+
labs(title = "overall V.S X1min_down_5",
x = 'X1min_down_5',
y='overall')
4.4.6 pair plots
pairs(df_all[c("X10sec_down_3.x","X10sec_down_3.y","X30sec_down_3.x","X30sec_down_3.y")])
#df_all
pairs(df_all[c("X1min_down_5.x","X1min_down_5.y",
"X3min._down_5.x","X3min._down_5.y",
"X5min._down_5.x","X5min._down_5.y")])
#df_all
pairs(df_all[c("X30sec_plusminus_5.x","X30sec_plusminus_5.y",
"X1min_plusminus_5.x","X1min_plusminus_5.y",
"X3min_plusminus_5.x","X3min_plusminus_5.y")])